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Abstract 

We introduce local expectation gradients which is a general purpose stochastic variational infer¬ 
ence algorithm for constructing stochastic gradients through sampling from the variational distribu¬ 
tion. This algorithm divides the problem of estimating the stochastic gradients over multiple varia¬ 
tional parameters into smaller sub-tasks so that each sub-task exploits intelligently the information 
coming from the most relevant part of the variational distribution. This is achieved by performing an 
exact expectation over the single random variable that mostly correlates with the variational parameter 
of interest resulting in a Rao-Blackwellized estimate that has low variance and can work efficiently 
for both continuous and discrete random variables. Furthermore, the proposed algorithm has interest¬ 
ing similarities with Gibbs sampling but at the same time, unlike Gibbs sampling, it can be trivially 
parallelized. 


1 Introduction 

Stochastic variational inference has emerged as a promising and flexible framework for performing large 
scale approximate inference in complex probabilistic models. It significantly extends the traditional vari¬ 
ational inference framework [SI [TJ by incorporating stochastic approximation iflOl into the optimization 
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of the variational lower bound. Currently, there exists two major research directions in stochastic varia¬ 
tional inference. The first attempts to deal with massive datasets by constructing stochastic gradients by 
using mini-batches of training examples (2][3]|. The second direction aims at dealing with the intractable 
expectations under the variational distribution that are encountered in non-conjugate probabilistic mod¬ 
els e Elia [mss ei. The unified idea in the second direction is that stochastic optimization can be 
carried out by sampling from the variational distribution. This results in a doubly stochastic estimation 
approach, where the mini-batch source of stochasticity (the first direction above) can be combined with 
a second source of stochasticity associated with sampling from the variational distribution. Furthermore, 
within the second direction there exist now two main sub-classes of methods: the first based on the log 
derivative trick [0 |8j [6l and the second based on the reparametrization trick [[5} j9j, 12]. The first ap¬ 
proach is completely general and can allow to apply stochastic optimization of variational lower bounds 
corresponding to arbitrary models and forms for the variational distribution. For instance, probabilistic 
models having both continuous and discrete latent variables can be accommodated by the log derivative 
trick approach. On the other hand, the reparametrization approach is specialised to continuous spaces 
and probabilistic models with differentiable joint distributions. 

In this paper, we are interested to further investigate the doubly stochastic methods that sample from 
the variational distribution. A challenging issue here is concerned with the variance reduction of the 
stochastic gradients. Specifically, while the method based on the log derivative trick is the most general 
one, it has been observed to severely suffer from high variance problems d [8l [6]] and thus it is only 
applicable together with sophisticated variance reduction techniques based on control variates. However, 
the construction of efficient control variates is a very challenging issue and in each application depends 
on the form of the probabilistic model. Therefore, it would be highly desirable to investigate whether it is 
possible to avoid using control variates altogether and construct simple stochastic gradient estimates that 
can work well for any probabilistic model. Notice, that while the reparametrization approach |[5l l ( ). 12] 
has shown to perform well without the use of control variates, it is at the same time applicable only to 
a restricted class of variational inference problems that involve probabilistic models with differentiable 
joint distributions and variational distributions typically taken from the location-scale family. 

Next, we introduce a general purpose algorithm for constructing stochastic gradients by sampling 


2 


from the variational distribution that has a very low variance and can work efficiently without the need of 
any variance reduction technique. This method builds upon the log derivative trick and it is based on the 
key observation that stochastic gradient estimation over multiple variational parameters can be divided 
into smaller sub-tasks where each sub-task requires using more information coming from some part of 
the variational distribution and less information coming from other parts. For instance, assume we have 
a factorized variational distribution of the form n"=i Qvi( x i) where v t is a local variational parameter and 
Xi the associated latent variable. Clearly, Vi determines the distribution over x r , and therefore we expect 
the latent variable x t to be the most important piece of information for estimating v, or its gradient. 
Based on this intuitive observation we introduce the local expectation gradients algorithm that provides 
a stochastic gradient over n, by performing an exact expectation over the associated random variable 
Xi while using a single sample from the remaining latent variables. Essentially this consist of a Rao- 
Blackwellized estimate that allows to dramatically reduce the variance of the stochastic gradient so that, 
for instance, for continuous spaces the new stochastic gradient is guaranteed to have lower variance than 
the stochastic gradient corresponding to the state of the art reparametrization method. Furthermore, the 
local expectation algorithm has striking similarities with Gibbs sampling with the important difference, 
that unlike Gibbs sampling, it can be trivially parallelized. 

The remainder of the paper has as follows: Section[2]discusses the main two types of algorithms for 
stochastic variational inference that are based on simulating from the variational distribution. Section 
[3] describes the proposed local expectation gradients algorithm. Section [4] provides experimental results 
and the paper concludes with a discussion in Section [5J 

2 Stochastic variational inference 

Here, we discuss the main ideas behind current algorithms on stochastic variational inference and par¬ 
ticularly doubly stochastic methods that sample from the variational distribution in order to approximate 
intractable expectations using Monte Carlo. Given a joint probability distribution p{ y,x) where y are 
observations and x are latent variables (possibly including model parameters that consist of random vari- 
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ables) and a variational distribution g v (x), the objective is to maximize the lower bound 

F(v) = Eq v(x) [logp(y, x) - log g v (x)], (1) 

= E 9v( x) [logp(y, x)] - E gv(x) [log q'v(x)] , (2) 

with respect to the variational parameters v. Ideally, in order to tune v we would like to have a closed- 
form expression for the lower bound so that we could subsequently maximize it by using standard opti¬ 
mization routines such as gradient-based algorithms. However, for many probabilistic models and forms 
of the variational distribution at least one of the two expectations in Q are intractable. For instance, 
if p( y, x) is defined though a neural network, the expectation E r/v(x;i [logp(y,x)] will be analytically 
intractable despite the fact that q v (x) might have a very simple form, such as a Gaussian, so that the 
second expectation in eq. ([2]) (i.e. the entropy of y v (v)) will be tractable. In other cases, such as when 
g v (v) is a mixture model or is defined through a complex graphical model, the entropic term will also be 
intractable. Therefore, in general we are facing with the following intractable expectation 

F(v) = E„ vM [/(x)] , (3) 

where /(x) can be either logp(y, x), — logg v (x) or logp(y, x) — logg v (x), from which we would like 
to efficiently estimate the gradient over v in order to apply gradient-based optimization. 

The most general method for estimating the gradient V v F(v) is based on the log derivative trick 
BE,I6]|. Specifically, this makes use of the property V v g v (x) = g v (x)V v logg v (x), which allows to 
write the gradient as 

V v E(v) = Eq v(x) [/(x)Vv log g v (x)] (4) 

and then obtain an unbiased estimate according to 

1 S 

-y'/(x<*>)V v log 9v (x<*>), (5) 

S —1 

where each x^ is an independent draw from q v (x). While this estimate is unbiased, it has been observed 
to severely suffer from high variance so that in practice it is necessary to consider variance reduction 
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techniques such as those based on control variates [0 111 HI . Despite this limitation the above framework 
is very general as it can deal with any variational distribution over both discrete and continuous latent 
variables. The proposed method presented in Section [3] is essentially based on the log derivative trick, 
but it does not suffer from high variance. 

The second approach is suitable for continuous spaces where /(x) is a differentiable function of x 
ttHEL 121. It is based on a simple transformation of ([3]) which allows to move the variational parameters v 
inside /(x) so that eventually the expectation is taken over a base distribution that does not depend on the 
variational parameters anymore. For example, if the variational distribution is the Gaussian jV(x|/z, LL ' ) 
where v = (n,L), the expectation in ^ can be re-written as F(n,L) = f Af(z\0,1)f(/u, + Lz)dz 
and subsequently the gradient over (//. L) can be approximated by the following unbiased Monte Carlo 
estimate 

1 S 

■a £ V ( „,n/fM + iz (>) ), (6) 

° S—1 

where each z ( ' s> is an independent sample from _A/"(z|0, 1). This estimate makes efficient use of the slope 
of /(x) which allows to perform informative moves in the space of (/r. L). For instance, observe that as 
L —> 0 the gradient over /r approaches V M /(/Lt) so that the optimization reduces to a standard gradient- 
ascent procedure for locating a mode of /(x). Furthermore, it has been shown experimentally in several 
studies [EllHIIJj that the estimate in (|6]) has relatively low variance and can lead to efficient optimization 
even when a single sample is used at each iteration. Nevertheless, a limitation of the approach is that it 
is only applicable to models where x is continuous and /(x) is differentiable. Even within this subset of 
models we are also additionally restricted to using certain classes of variational distributions 10 13. 

Therefore, it is clear that from the current literature is lacking a universal method that both can be 
applicable to a very broad class of models (in both discrete and continuous spaces) and also provide 
low-variance stochastic gradients. Next, we introduce such an approach.. 

3 Local expectation gradients 

Suppose that the n-dimensional latent vector x in the probabilistic model takes values in some space 
<Si x ... S n where each set S, can be continuous or discrete. We consider a variational distribution over 
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x that is represented as a directed graphical model having the following joint density 

n 

9v(x) = n^NP a *)> (7) 

i— 1 

where g,. (a^lpaj is the conditional factor over x t given the set of the parents denoted by pa ( . We assume 
that each conditional factor has its own separate set of variational parameters v t and v = (v t ,.... v n ). 
The objective is then to obtain a stochastic approximation for the gradient of the lower bound over each 
variational parameter ry based on the log derivative form in eq. (|4]). 

Our method is motivated by the observation that each parameter v t is rather influenced mostly by its 
corresponding latent variable Xi since v, determines the factor q Vi {xi |pa, : ). Therefore, to get information 
about the gradient of v % we should be exploring multiple possible values of x t and a rather smaller set of 
values from the remaining latent variables x\ j. Next we take this idea into the extreme where we will be 
using infinite draws from Xi (i.e. essentially an exact expectation) together with just a single sample of 
x\j. More precisely, we factorize the variational distribution as follows 


9 v(x) = g(xi|mb i )g(x\ i ), (8) 

where mb,; denotes the Markov blanket of x t . By using the log derivative trick the gradient over v t can be 
written as 


V„,E(v) = E, (x) [/(x)V Wi logg^OilpaJ], 

= E g ( x ^,) [lEg^imh^ [/(x)V Vj log ^(xjjpaj]] , 


(9) 


where in the second expression we used the law of iterated expectations. Then, an unbiased stochastic 
gradient, say at the t-th iteration of an optimization algorithm, can be obtained by drawing a single sample 
x|9 from gfx,,) so that 


E 


^(xjlmb^) 


/( x \?, Xi)W Vi log q Vi (xi\pa\ 




( 10 ) 


which is the expression for the proposed stochastic gradient for the parameter ig. To get an independent 
sample x^ from g(x\,) we can simply simulate a full latent vector x^) from g v ( x ) by applying the 
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Algorithm 1 Stochastic variational inference using local expectation gradients 

Input: /(x), <? v (x). 

Initialize v^, t = 0. 

repeat 

Set t — t + 1. 

Draw pivot sample x^ ~ g v (x). 


for i = 1 to n do 



dv > = E a (.,M.«> 

/(X\* } , x i)V Vi log Qvi (^i Paf 5 ) 


Vi = Vi + r/tdvi. 



end for 

until convergence criterion is met. 



standard ancestral sampling procedure for directed graphical models fl]. Then, the sub-vector x(v is by 


construction an independent draw from the marginal g(xw). Furthermore, the sample x ,J can be thought 
of as a pivot sample that is needed to be drawn once and then it can be re-used multiple times in order to 
compute all stochastic gradients for all variational parameters v\,... ,v n according to eq. ( [TO] ). 


When the variable x, takes discrete values, the expectation under g(xj|mbf^) in eq. (10) reduces to a 
sum of terms associated with all possible values of Xj. On the other hand, when is a continuous vari¬ 
able the expectation in ([TO]) corresponds to an univariate integral that in general may not be analytically 
tractable. In this case we shall use fast numerical integration methods (e.g. Gaussian quadrature when 
q Vi {xi |paj is Gaussian). 

We shall refer to the above algorithm for providing stochastic gradients over variational parameters as 
local expectation gradients and pseudo-code of a stochastic variational inference scheme that internally 
uses this algorithm is given in Algorithm [I] Notice that Algorithm [T] corresponds to the case where 
/(x) = logp(y, x) — logg v (x) while other cases can be expressed similarly. 

In the next two sections we discuss the computational complexity of the proposed algorithm and draw 
interesting connections between local expectation gradients with Gibbs sampling (Section |3.1[ ) and the 
reparametrization approach for differentiable functions /(x) (Section |3.2[). 


3.1 Time complexity and connection with Gibbs sampling 

In this section we discuss computational issues when running the proposed algorithm and we point out 
similarities and difference that has with Gibbs sampling. 
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Let us assume that time complexity is dominated by function evaluations of /(x). We further assume 
that this function neither factorizes into a sum of local terms, where each term depends on a subset of 
variables, nor allows savings by performing incremental updates of statistics computed during interme¬ 
diate steps when evaluating /(x). Based on this the complexity per iteration is 0(nK) where K is the 
maximum number of evaluations of /(x) needed when estimating the gradient for each v t . If each x r 
takes K discrete values, the exact number of evaluations will be n(K — 1) + 1 where the “plus one” 
comes from the evaluation of the pivot sample that needs to be performed once and then it can be 
re-used for each i — 1,..., n. Notice that this time complexity corresponds to a worst-case scenario. In 
practice, we could save a lot of computations by taking advantage of any factorization in /(x) and also 
the fact that any function evaluation is performed for inputs that are the same as the pivot sample x^ but 
having a single variable changed. Furthermore, once we have drawn the pivot sample x^ all function 
evaluations can be trivially parallelized. 

There is an interesting connection between local expectation gradients and Gibbs sampling. In par¬ 
ticular, carrying out Gibbs sampling in the variational distribution in eq. ([7]) requires iteratively sampling 
from each conditional q(xi\mbi), for i = 1,..., n. Clearly, the same conditional appears also in the local 
expectation algorithm when estimating the stochastic gradients. The obvious difference is that instead 
of sampling from f/(.x ( |mb,) we now average under this distribution. Furthermore, for models having 
discrete latent variables the time complexity per iteration is the same as with Gibbs sampling with the 
important difference, however, that the algorithm of local expectation gradients is trivially parallelizable 
while Gibbs sampling is not. 

3.2 Connection with the reparametrization approach 

A very interesting property of local expectation gradients is that it is guaranteed to provide stochastic 
gradients having lower variance that the state of the art reparametrization method [OH![12], also called 
stochastic belief propagation |[9j, which is suitable for continuous spaces and differential functions /(x). 
Specifically, we will prove that this property holds for any factorized location-scale variational distribu¬ 
tion of the form 

n 

9v(x) = (11) 

2—1 
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where Vi = (//,, /i, is a location-mean parameter and £ t is a scale parameter. Given that q(zi ) is the 

base distribution based on which we can reparametrize Xi according to = //, + with z* ~ q(zi), the 
single-sample stochastic gradient over Vi is given by 




( 12 ) 


where /r is the vector of all /x ( s, similarly ^ and z(^ are vectors of C,s and zf* s, while o denotes element¬ 
wise product. The local expectation stochastic gradient from eq. ([TO]) takes the form 


q Vi (xi)f{x.^,Xi)W Vi log q Vi ( Xi)dx u 
V Vi q Vi (xi)f(^\-,Xi)dxi. 


(13) 


Now notice that + £\ % o z':. Also by interchanging the order of the gradient and integral 

operators together with using the base distribution we have 


V 0i J q Vi (xi)f(l*\i + Ai ° z ^,Xi)dxi, 

= / q(zi)f(n + £o z (t) )dzi, 

= [ q(zi)V v J(fi + £oz {t) )dzi, 


(14) 


where the final eq. ( fl4] ) is clearly an expectation of the reparametrization gradient from eq. (12). There¬ 
fore, based on the standard Rao-Blackwellization argument the variance of the stochastic gradient ob¬ 
tained by ( p~4] ) will always be lower or equal than the variance of the gradient of the reparametriza¬ 
tion method. To intuitively understand this, observe that eq. ( fTdj ) essentially says that the single-sample 
reparametrization gradient from ([12]) is just a Monte Carlo approximation to the local expectation stochas¬ 
tic gradient obtained by drawing a single sample from the base distribution, thus naturally it should have 
higher variance. In the experiments in Section[4]we typically observe that the variance provided by local 
expectation gradients is roughly one order of magnitude lower than the corresponding variance of the 
reparametrization gradients. 
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4 Experiments 


In this section, we apply local expectation gradients (LeGrad) to different types of stochastic variational 
inference problems and we compare it against the standard stochastic gradient based on the log derivative 
trick (LdGrad) described by eq. ([5]) as well as the reparametrization-based gradient (ReGrad) given by 
eq. ([6]). In Section 4.1 we consider fitting using a factorized variational Gaussian distribution a highly 


correlated multivariate Gaussian. In Section 4.2 we consider a two-class classification problem using 
two digits from the MNIST database and we approximate a Bayesian logistic regression model using 


stochastic variational inference. Finally, in Section 4.3 we consider a sigmoid belief network with one 
layer of hidden variables and we fit it to the binarized version of the MNIST digits. For this problem we 
also parametrize the variational distribution using a recognition model. 


4.1 Fitting a high dimensional Gaussian 

We start with a simple “artificial” variational inference problem where we would like to fit a factorized 
variational Gaussian distribution of the form 


9v(x) = Yl^{xi\nu^), (15) 

i=l 

to a highly correlated multivariate Gaussian of the form A/"(x|m, E). We assume that n = 100, m = 2, 

where 2 is the 100-dimensional vector of 2s. Further, the correlated matrix E was constructed from a 

1 / \ 2 _ 

kernel function so that E^ = e~^ Xi ~ Xj> + 0.1 Oij and where the inputs where placed in an uniform grid 
in [0,10]. This covariance matrix is shown in Figure[l] The smaller eigenvalue of E is roughly 0.1 so we 
expect that the optimal values for the variances if to be around 0.1 (see e.g. JU) while the optimal value 
for each is 2. 

Given that the latent vector x is continuous, to obtain the stochastic gradient for each (/ij, if) we need 
to apply numerical integration. More precisely, the stochastic gradient for LeGrad according to eq. ( fTT)| ) 
reduces to an expectation under the Gaussian J\[{xi\ni, if) and therefore we can naturally apply Gaussian 
quadrature. We used the quadrature rule having K = 5 grid point^j] so that the whole complexity of 
'Gaussian quadrature with K grid points integrates exactly polynomials up to 2 K — 1 degree. 
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LeGrad was 0(nK) = 0(500) function evaluations per iteration (see Section [XT] ). When we applied the 
standard LdGrad approach we set the number of samples in ([5]) equal to S — 500 so that the computational 
costs of LeGrad and LdGrad match exactly with one another. When using the ReGrad approach based on 
(|6]) we construct the stochastic gradient using a single sample, as this is typical among the practitioners 


that use this method, and also because we want to empirically confirm the theory from Section 3.2 which 
states that the LeGrad method should always have lower variance than ReGrad given that the latter uses 
a sample size of one. 

Figure |2fa) shows the evolution of the variance of the three alternative stochastic gradients (esti¬ 
mated by using the first variational parameter //1 and a running window of 10 previous iterations) as the 
stochastic optimization algorithm iterates. Clearly, LeGrad (red line) has the lowest variance, then comes 
ReGrad (blue line) and last is LdGrad which despite the fact that it uses 500 independent samples in the 
Monte Carlo average in eq. ([5]) suffers from high variance. Someone could ask about how many samples 
the LdGrad method requires to decrease its variance to the level of the LeGrad method. In this example, 
we have found empirically that this is achieved when LdGrad uses S = 10 4 samples; see Figure [^b). 
This shows that LeGrad is significantly better than LdGrad and the same conclusion is supported from 
the experiment in Section |4~2] where we consider a Bayesian logistic regression model. 

Furthermore, observe that the fact that LeGrad has lower variance than ReGrad is in a good accor¬ 


dance with the theoretical results of Section 3.2 Notice also that the variance of LeGrad is roughly one 
order of magnitude lower than the variance of ReGrad. 

Finally, to visualize the convergence of the different algorithms and their ability to maximize the 
lower bound in Figure |2])c) we plot the stochastic value of the lower bound computed at each iteration 
by drawing a single sample from the variational distribution. Clearly, the stochastic value of the bound 
can allow us to quantify convergence and it could also be used as a diagnostic of high variance problems. 
From Figure [2|c) we can observe that LdGrad makes very slow progress in maximizing the bound while 
LeGrad and ReGrad converge rapidly. 
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Figure 1: The 100 x 100 covariance matrix E used in the experiment in Section 4.1 



Figure 2: The panel in (a) shows the variance of the gradient for the variational parameter pi when using 
LeGrad (red line), ReGrad (blue line) and LdGrad (green line). The number of samples used by LdGrad 
was S = 500. The panel in (b) shows again the variances for all methods (the red and blue lines are from 
(a)) when LdGrad uses S = 10 4 samples. The panel in (c) shows the evolution of the stochastic value of 
the lower bound (for LdGrad S = 500 was used). 


4.2 Logistic regression 


In this section we compare the three approaches in a challenging binary classification problem using 
Bayesian logistic regression. We apply the stochastic variational algorithm in order to approximate the 
posterior over the regression parameters. Specifically, given a dataset V = {z ? , y ? ,, where z m e R" 
is the input and y m e ( — 1, +1} the class label, we model the joint distribution over the observed labels 
and the parameters w by 


p{ y> w ) 


M 


n 


' Z m W > 


\m=l 


p( w), 


where a (a) is the sigmoid function and p( w) denotes a zero-mean Gaussian prior on the weights w. As 
in the previous section we will assume a variational Gaussian distribution defined as in eq. ([15]) with the 
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only notational difference that the unknowns now are model parameters, denoted by w, and not latent 
variables. 

For the above setting we considered a subset of the MNIST dataset that includes all 12660 training 
examples from the digit classes 2 and 7. We applied all three stochastic gradient methods using a setup 
analogous to the one used in the previous section. In particular, the LeGrad method uses again a Gaussian 
quadrature rule of K = 5 grid points so that the time complexity per iteration was 0(nK) = 0(785 * 5) 
and where the number 785 comes from the dimensionality of the digit images plus the bias term. To 
match this with the time complexity of LdGrad, we will use a size of S — 3925 samples in the Monte 
Carlo approximation in eq. ([5]). 

Figure [3ja) displays the variance of the stochastic gradient for the LeGrad method (green line) and 
the ReGrad method (blue line). As we can observe the local expectation method has roughly one order 
of magnitude lower variance than the reparametrization approach which is in a accordance with the 
results from the previous section (see Figure [2]). In contrast to these two methods, which somehow have 
comparable variances, the LdGrad approach severely suffers from very high variance as shown in Figure 
[3](b) where the displayed values are of the order of 10 7 . Despite the fact that 3925 independent samples 
are used in the Monte Carlo approximation still LdGrad cannot make good progress when maximizing 
the lower bound. To get a sensible performance with LdGrad we will need to increase considerably the 
sample-size which is computationally very expensive. Of course, control variates can be used to reduce 
variance to some extend, but it would have been much better if this problem was not present in the first 
place. LeGrad can be viewed as a certain variant of LdGrad but has the useful property that does not 
suffer from the high variance problem. 

Finally, Figure |3jb) shows the evolution of the stochastic value of the lower bound for all three 
methods. Here, we can observe that LeGrad has significant faster and much more stable convergence 
than the other two methods. Furthermore, unlike in the example from the previous section, in this real 
dataset the ReGrad method exhibits clearly much slower convergence that the LeGrad approach. 
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Figure 3: The panel in (a) shows the variance of the gradient for the variational parameter /j i when using 
LeGrad (red line), ReGrad (blue line), while panel (b) shows the corresponding curve for LdGrad (green 
line). The number of samples used by LdGrad was S = 3925. The panel in (c) shows the evolution of 
the stochastic values of the lower bound. 


4.3 Fitting one-layer sigmoid belief net with a recognition model 


In the final example we consider a sigmoid belief network with a single hidden layer. More precisely, 
by assuming the observations are binary data vectors of the form y* e (0, l}^, such a sigmoid belief 


network assumes that each y, is generated independently according to 


D 

p(y\ w ) = 5ZII [ cr ( w rf x )] I/d - cr ( w I x )] 1-Iu p( x )» ( 16 ) 

x d= 1 


where x e (0,1 } K is a vector of hidden variables while the prior p(x) is taken to be uniform. The matrix 


W (that incorporates also a bias term) consists of the set of model parameters to be estimated by fitting the 


model to the data. In theory we could use the EM algorithm to learn the parameters W, however, such an 


approach is not feasible because at the E step we need to compute the posterior distribution p(xj|yj, W) 
over each hidden variable which clearly is intractable since each x, takes 2 A values. Therefore, we 


need to apply approximate inference and next we consider stochastic variational inference using the local 


expectation gradients algorithm. 

More precisely, by following recent trends in the literature for fitting this type of models @ 0 HI 


we assume a variational distribution parametrized by a “reverse” sigmoid network that predicts the latent 
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Figure 4: The first row shows 100 training examples and the learned model parameters W for all K = 200 
hidden variables. The second row shows the corresponding reconstructed data and the evolution of the 
stochastic value of the lower bound. 


vector Xj from the associated observation y ( : 

K 

qv(*i) = n [o-(Vfcyt)] Xi * [1 - ^(vjyi)] 1 '* 1 ", (17) 

k= 1 

where V is a matrix (that incorporates also a bias term) comprising the set of all variational parameters 
to be estimated. Often a variational distribution of the above form is referred to as the recognition model 
because allows to predict the activations of the hidden variables in unseen data y* without needing to fit 
each time a new variational distribution. 

Based on the above model we considered a set of 1000 binarized MNIST digits so that 100 examples 
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were chosen from each digit class. The application of stochastic variational inference is straightforward 
and boils down to constructing a separate lower bound for each pair (y, : , x,) having the form 


D 

Fi{V, w) = ^gv(Xj)log JJ[<7(wJxi)] w<i [l - ^(wjxi)] 1 "^ 

Xi d= 1 

K K 

- a ( v kyi) !og^( v Iy*) - - a ( v ^)) 1 °g( 1 _ a ( v k yO)- ( 18 ) 

k= 1 k= 1 


The total lower bound is then expressed as the sum of all data-specific bounds and it is maximized using 
stochastic updates to tune the forward model weights W and the recognition weights V. Specifically, 
the update we used for W was based on drawing a single sample from the full variational distribution: 
qy(xj), with i = 1 ,,n. The update for the recognition weights was carried out by computing the 
stochastic gradients according to the local expectation gradients so that for each v/, the estimate obtained 
the following form 


^ 1 Vv/. ^ ] ®ik (1 ®ik) 


i=l 


i= 1 


{l + 0) \ (\-G ik 

Z^ lo S 1 -~ I + lo S 


,d= 1 


1 + e -2/<d w I( x i\fc^ifc =1 ) 


Gik 


y 

(19) 


where a ik = cr(vjy*) and y u i is the { — 1,1} encoding of y t d- Figure |4] shows several plots that illustrate 
how the model fits the data and how the algorithm converges. More precisely, we optimized the model 
assuming K = 200 hidden variables in the hidden layer of the sigmoid network with the corresponding 
weights W shown in Figure [4} Clearly, the model is able to provide very good reconstruction of the 
training data and exhibits also a fast and very stable convergence. 

Finally, for comparison reasons we tried also to optimize the variational lower bound using the Ld- 
Grad algorithm. However, this proved to be very problematic since the algorithm was unable to make 
good progress and has severe tendency to get stuck to local maxima possibly due to the high variance 
problems. 
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5 Discussion 


We have presented a stochastic variational inference algorithm which we call local expectation gradients. 
This algorithm provides a a general framework for estimating stochastic gradients that exploits the local 
independence structure of the variational distribution in order to efficiently optimize the variational pa¬ 
rameters. We have shown that this algorithm does not suffer from high variance problems. Future work 
will concern with the further theoretical analysis of the properties of the algorithm as well as applications 
to hierarchical probabilistic models such as sigmoid belief networks with multiple layers. 
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